Metropolis

The Metropolis algorithm is the simplest Markov Chain Monte Carlo (MCMC) algorithm. Even though it is not used anymore because it is not very efficient, it is useful to explain the key components of MCMC algorithms.

The goal of an MCMC algorithm is to generate a posterior distribution, i.e. we want to calculate how probable parameters are given the data, the likelihood and the prior.

To introduce the Metropolis algorithm, we use a simple example where we try to estimate the mean and standard deviation of these data:

set.seed(1)
y = rnorm(100)
hist(y)

The log posterior

As a first step, we need to calculate the probability of a parameter given the data, likelihood and priors. Here, we use the by now hopefully well known ingredients:

calc.lp = function(mu,log_sigma,x) {
  return(
    dnorm(x, mu, exp(log_sigma), log = T) %>% sum() + # likelihood
      dnorm(mu, mean = 0, sd = 1, log = T) +          # prior for the mean
      dnorm(log_sigma, mean = 0, sd = 1, log = T)     # prior for the sd
  )
}

As is generally done in Bayesian computation, we do the computations on the log scale, so multiplication becomes addition.

(Putting a normal prior on log_sima and later exponentiation is slightly unusual. The reason this is done here is that I wanted to avoid non-symmetric proposal on the prior for the error variance.)

To start the sampling process, we set some initial parameter values. We store the initial values in vectors that will also hold our posterior samples. The vector lp holds the log posteriors.

# number of samples we want to draw
iter = 4000
# vectors to store simulation results
post.mu = vector(length = iter)
post.log_sigma = vector(length = iter)
lp = vector(length = iter)
# initialize parameter values (usually done ranomly)
post.mu[1] = -2
post.log_sigma[1] = 1.5
# calculate log posterior
lp[1] = calc.lp(post.mu[1], post.log_sigma[1], y)

Making proposals

Proposals for the Metropolis algorithm have to be symmetric around the current parameter values. Here we use a normal distribution with a standard deviation of 0.05.

This standard deviation is an important parameter because if it is to small and we initialize far away from the posterior distribution the algorithm will take a lot of time to find the posterior distribution. On the other hand, if the standard deviation is to large, the algorithm will make lots of proposals that are rejected, which also slows things down.

One rule of thumb is to set the standard deviation such that around 80% of the proposals are accepted. This is partially a process of trial and error, though it helps to choose reasonable priors and to choose standard deviations for proposals that are consistent with the priors. Luckily, modern MCMC samplers are tuned automatically.

i = 2 # current iteration
proposal.mu = 
  rnorm(1, 
        mean = post.mu[i-1],
        sd = 0.05)
proposal.log_sigma = 
  rnorm(1, 
        mean = post.log_sigma[i-1], 
        sd = 0.05)

We could also choose different standard deviations for the different parameters.

Adding a sample

Because our goal is to describe a posterior distribution, we should choose more of those samples that have a high log posterior.

On the other hand, we also want explore the space of possible parameters and should therefor also accept samples that have lower log posteriors. We do not simply want to find the maximum a posteriori, i.e. the parameter combination at which the log posterior is highest, but we want to know the (relative) probability of parameters combinations.

The decision about whether the proposal is accepted as the new sample depends on the relative log posteriors of the old sample and the proposal ,i.e. it depends on if the data are more or less likely under the proposal compared to the previous sample. So we next calculate the log posterior for the new sample.

proposal.lp = calc.lp(proposal.mu, proposal.log_sigma, y)

The decision rule about accepting the proposal as the next samples is as follows:

  • If the log posterior of the proposal is higher, always accept the proposal as the new sample
  • If the log posterior of the proposal is lower, randomly choose between last sample and proposal
    • The log posteriors determine the probability of choosing the proposal or old sample: The smaller the log posterior of the proposal is compared to that of the last sample, the smaller is the probability to choose the proposal.

This was the decision rule in words, here it is in code:

First, we want to know what is the relative probability of the data under the proposal, compared to under the last sample.

c(log_post.last_sample = lp[1], log_post.proposal = proposal.lp)
## log_post.last_sample    log_post.proposal 
##            -259.9161            -260.4642

Because we are on the log scale, we subtract and then take the exponent:

exp(proposal.lp-lp[1])
## [1] 0.578066

So the probability of the proposal is 0.58 time the probability of the last sample.

Because the proposal has a lower probability, we choose randomly, i.e. using a random number between 0 and 1:

r = runif(1)
accept = 
  ifelse(proposal.lp > lp[1], TRUE,
         ifelse(r < exp(proposal.lp-lp[1]), TRUE, FALSE))
accept
## [1] TRUE

In this case we accept the proposal.

Now that we have decided about the first sample, we generate a new proposal to generate the next sample.

Un-hide the next code block to see how to generate 4000 samples.

last.lp = proposal.lp
for (k in 2:iter) {
  # generate proposal and calculate log_posterior
  proposal.mu = rnorm(1,mean = post.mu[k-1], sd = .05)
  proposal.log_sigma = rnorm(1, mean = post.log_sigma[k-1], sd = .05)
  proposal.lp = calc.lp(proposal.mu, proposal.log_sigma, y)
  # acceptance probability
  acceptance = min(1, exp(proposal.lp-last.lp))
  # acceptance decision rule
  if (acceptance >= 1) {
    post.mu[k] = proposal.mu
    post.log_sigma[k] = proposal.log_sigma
    last.lp = proposal.lp
  } else {
    if (runif(1) < acceptance) {
      post.mu[k] = proposal.mu
      post.log_sigma[k] = proposal.log_sigma
      last.lp = proposal.lp
    } else {
      post.mu[k] = post.mu[k-1]
      post.log_sigma[k] = post.log_sigma[k-1]
    }
  }
}

And here is the result of this process:

Metropolis algorithm visualized. Red dots are rejected proposals

Did we learn the correct mean and standard deviation?

We can plot the posterior distribution, which does not include the first 2000 burn in samples in which the sampler tried to find the posterior distribution.

par(mar=c(3,3,2,.5), mgp=c(1.75,.5,0), tck=-.01)
layout(matrix(c(1,3,1,3,2,4), nrow = 2))
posterior.histogram = function(x,xlab) {
  plot(1:2000,x, 'l', ylab = xlab, xlab = "iteration",
       main = paste0("traceplot ",xlab))
  hist(x, main = paste0("mean = ", round(mean(x),2)), xlab = xlab)
}
posterior.histogram(post.mu[-(1:2000)],"mu")
posterior.histogram(exp(post.log_sigma[-(1:2000)]),"sigma")

And to double check, here are arithmetic mean and sd:

c(mean = mean(y), sd = sd(y))
##      mean        sd 
## 0.1088874 0.8981994

While this was a trivial example, it shows us that MCMC, here in the form of the Metropolis algorithm, can compute posterior distributions.

With an infinite number of posterior samples the simple Metropolis algorithm can recover any posterior. But we cannot generate an infinte number of samples.

Why Hamiltonian Monte Carlo?

The Metropolis algorithm and improved versions like the Gibbs sampler perform generally well, but they have problems when parameters are correlated.

Take for example these data with high colinearity, akin to the 2-legs example:

set.seed(1)
Sigma = matrix(c(1,.95,.95,1), nrow = 2)
X = MASS::mvrnorm(100,mu = c(0,0), Sigma = Sigma)
Y = X %*% c(1,1) + rnorm(100)
XY = cbind(X1 = X[,1], X2 = X[,2], Y = Y)
colnames(XY)[3] = "Y"
par(mar=c(3,3,2,1), mgp=c(1.25,.5,0), tck=-.01)
pairs(XY)

If we estimate a model \(\small Y \sim Normal(\beta_1 X_1 \ + \beta_2 X_2, \sigma)\) the coefficients \(\small \beta_1\) and \(\small \beta_2\) are highly correlated.

Let’s see what the Metropolis sampler does with this:

Metropolis sampling for hard a problem

We can observe following things:

This is just one example where Metropolis or Gibbs samplers can have difficulties.

Fitting models with Stan and ulam

One sampling algorithm that does much better than Metropolis is Gibbs sampling Hamiltonian Monte Carlo. On an intuitive level, the key difference in favor of Hamiltonian Monte Carlo is that it generates proposals less randomly than either Metropolis or Gibbs. Here, less randomly means that the direction from the current sample to the proposal is not random, but that proposals are more likely in direction of the bulk of the posterior distribution. See here for animations of different model estimation algorithm.

To clarify the relationship of different terms:

Fitting in ulam

We first put the data for the model together:

data.list = list(
  Y = as.vector(Y),
  X1 = X[,1],
  X2 = X[,2]
)

Note that differently than for quap models, we need to do all transformations before we submit the data to ulam. It is OK to submit data as data.frame or a list. The latter is more flexible, which is why we use it.

Next we define the rethinking model:

model = alist(
  Y ~ normal(mu,exp(log_sigma)),
  mu <- a + b1*X1 + b2*X2,
  a ~ dnorm(0,1),
  b1 ~ dnorm(0,1),
  b2 ~ dnorm(0,1),
  log_sigma ~ dnorm(0,1)
)

This is exactly the same model that we also fit the with Metropolis sampling.

Until here, everything was as we know it from quap models. But now we use the ulam function, which requires a few additional parameters because we are now actually doing simulations.

u.fit = ulam(
  model,
  data = data.list,
  iter = 2000,      # 2000 iterations, (1000 warmup)
  chains = 4,       # four chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan

Here is the how HMC (via ulam and Stan) explores the posterior distribution:

HMC via Stan and ulam for hard problem

Maybe the comparison to the Metropolis sampler is not so clear, so lets look at both together:

Metropolis vs HMC

A healthy chain

MCMC algorithms produces chains with posteriors, because the values of the posterior samples always depend on the previous sample. A healthy chain is a chain that fullfills certain criteria such that we can be (relatively) certain that statistics we calculate from the chain (like mean, sd, quantiles …) accurately describe the posterior distribution.

How do we know that the results from ulam are better than the results from the Metropolis sampling scheme?

First an example: This is an ensemble of 4 healthy chains:

trace.rankplot = function(chains) {
  par(mfrow = c(2,1), mar=c(3,3,0.1,.5), mgp=c(1.75,.5,0), tck=-.01)
  matplot(chains, typ = "l", lty = 1, ylab = "theta", xlab = "iteration") 
  ranks = matrix(rank(chains), ncol = ncol(chains))
  h = hist(ranks, plot = F, breaks = 30)
  rank.hists = 
    apply(ranks,2, 
          function(x) counts = hist(x,breaks = h$breaks, plot = F)$counts)
  par(mar=c(3,3,0,.5))
  matplot(h$mids,rank.hists,'s', ylim = c(0,max(rank.hists)),lty = 1, 
          xlab = "sample rank over all chains", ylab = "frequency")
}
chains.well.mixing = matrix(rnorm(4000),ncol = 4)
trace.rankplot(chains.well.mixing)

These chains are healthy because they are

One always needs multiple chains to properly evaluate if a model estimation converged.

Stationarity

Stationarity means that the chains vary around a stable mean, which one can imagine as horizontal line in a traceplot.

Here is an example of non-stationary chains:

AR = cbind(sin(seq(1,2*pi,length.out = 1000)),
           sin(0.5+seq(1,3*pi,length.out = 1000)),
           sin(1.5+seq(1,3*pi,length.out = 1000)),
           sin(2.5+seq(1,3*pi,length.out = 1000)))
chains.AR =
  chains.well.mixing/4 + 
  AR
trace.rankplot(chains.AR)  

These chains exhibit autocorrelation, because the samples i-1, i-2, i-3 … can be used to predict the parameter values at sample i. This is undesirable, because it means that the samples are not independent.

Good mixing

Good mixing means that the samples vary with a constant variance around their mean.

Here is an example of chains that are not well mixed:

chains.not.mixing = 
  matrix(rnorm(4000,
               sd = as.vector(AR)^2),
         ncol = 4)
trace.rankplot(chains.not.mixing)  

Convergence

Convergence means that all chains converged to the same mean. Here is an example of chains that did not converge:

offsets = cbind(rep(-.25,1000),
                rep(0,1000),
                rep(.25,1000),
                rep(.5,1000))
chains.not.converged = 
  chains.well.mixing/4 + 
  offsets
trace.rankplot(chains.not.converged)  

Evaluating chains

Convergence diagnostics

With convergence we mean that multiple chains that are initialized at different parameter values should converge to the same posterior distribution. MCMC sampling software like Stan by default uses multiple (4) chains that are initialized at random parameter values.

Here is a slightly unorthodox figure that shows this for the parameters b1 and b2 from our “hard” analysis with colinear predictors:

Convergence diagnostics uses both plots and statistics.

Most typically, one shows one traceplot per parameter:

traceplot_ulam(
  u.fit,
  pars = c("b1","b2"),
  n_cols = 1,
  max_rows = 2)

The warmup samples have grey background and are not part of the posterior distribution. While we can eyeball traceplots to see if the chains have converged, it is better to have some statistic that makes this more objective.

The key statistic here is Rhat, which is a number that–roughly said–compares the variances within chains to the variance between chains. If the chains would converge to different means, the variance between chains becomes larger than the variance within chains and the Rhat value becomes larger than 1.01, the currently recommended threshold.1 The precis function shows this and other values:

precis(u.fit) %>% round(3)
##            mean    sd   5.5% 94.5%    n_eff Rhat4
## a         0.027 0.106 -0.141 0.194 3350.598 1.000
## b1        1.140 0.311  0.648 1.630 1759.062 1.003
## b2        0.868 0.313  0.375 1.365 1742.313 1.003
## log_sigma 0.045 0.073 -0.068 0.164 2608.989 1.000

We can use our examples above to look at Rhat values for “bad” chains:

library(bayesplot)
## Warning: package 'bayesplot' was built under R version 4.3.3
library(posterior)

nms = c("mixing","not mixing", "AR", "not converging")

test.posteriors = 
  array(NA, dim = c(1000,4,4), 
        dimnames = list(iter = 1:1000, chain = 1:4, variable = nms))
test.posteriors[,,1] = chains.well.mixing
test.posteriors[,,2] = chains.not.mixing
test.posteriors[,,3] = chains.AR
test.posteriors[,,4] = chains.not.converged

par(mfrow = c(2,2), mar=c(2,3,2,.5), mgp=c(1.75,.5,0), tck=-.01)
for (j in 1:4) {
  matplot(test.posteriors[,,j],  
          typ = "l", lty = 1, ylab = "theta", xlab = "") 
  title(dimnames(test.posteriors)$variable[j])
}

test.posteriors %>% 
  summarise_draws() %>% 
  kable(digits = 2) %>% 
  kable_styling(full_width = FALSE)
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
mixing 0.00 -0.01 1.04 1.04 -1.71 1.71 1.00 3926.85 3990.31
not mixing -0.01 0.00 0.62 0.35 -1.15 1.11 1.01 3903.35 712.91
AR -0.07 -0.12 0.76 0.96 -1.16 1.13 1.18 17.94 169.91
not converging 0.12 0.13 0.38 0.41 -0.51 0.74 1.50 7.42 50.15

I have used the summarise_draws function from the posterior packages, which also shows the effective sample size (ess, this is the same as n_eff) for the bulk (something like the central part of the posterior distribution) and for the tail of the posterior distribution.

We see that Rhat and the effective sample size increase primarily when chains are non-stationary and do not converge.

The effective sample size (n_eff or ess) can be understood as the number of independent samples in the posterior. It is a direct function of Rhat and the number of posterior samples. One should not just increase the number of iterations when the effective sample size is low. Rather, one should check if the model can be improved through better priors and/or a re-parameterization.

Rhat, ess/n_eff, number of chains

  • We need multiple chains to check if the sampler converges to the same posterior
  • Rhat is the key convergence diagnostic
  • The effective number of samples depends on (a) Rhat and (b) the number of posterior samples. It tells us how much we can trust statistics we can calculate from the posterior. As a rule of thumb, a few hundred samples are enough. If we are mainly interested in the mean, we should look at the n_eff / ess for the bulk, if we are interested in the tail (e.g. when we are looking at cut-off values in the tail region) we should look at the n_eff / ess for the bulk.
  • Multiple chains are useful because they increase the number of posterior samples and thus contribute to ess / n_eff. More importantly, they allow to detect convergence problems if sampling starts at different random points for the different chains.
  • Typically 4 chains are sufficient. Fewer chains reduce the chance to detect problems. More chains can be useful to detect problems in tricky models. (for instance models with multi-modal posteriors)
  • In sum: Go with defaults!
    • 4 chains
    • 1000 warmup and 1000 post warm up iterations
    • check if Rhat <= 1.01
    • check if relevant ess / n_eff > 500

Bad models make bad posteriors

Andrew Gelman has described the The Folk Theorem of Statistical Computing, which means that if you have computational problems with your model, e.g. divergent transitions or non-converging chains, the culprit is not the algorithm but a suboptimally specified model.

The reasoning is that modern samplers like Stan’s modified NUTS Hamiltonian Monte Carlo sampler are so good that they can fit almost any model, and that therefore problems during model estimation are not due to the sampler but due to a model that is not specified well.

One important part of model specification are the priors.

To see this, lets look at the problem of collinearity and specify a extremely co-linear situation, where we use two times the same parameter to predict an outcome:

data.list = list(
  Y = as.vector(Y),
  X1 = X[,1]
)

If we specify a model \(\small y \sim normal(a + \beta_1 X_1 + \beta_2 X_2, \sigma)\):

wide.model = alist(
  Y ~ normal(mu,exp(log_sigma)),
  mu <- a + b1*X1 + b2*X1,
  a ~ dnorm(0,1),
  b1 ~ dnorm(0,1000),
  b2 ~ dnorm(0,1000),
  log_sigma ~ dnorm(0,1)
)

then this is non-identified because e.g. the prediction for

will be the same. In fact there is an infinite number of combinations of values for \(\small \beta_1\) and \(\small \beta_2\) that make identical predictions.

Lets see if we can still fit this model

u.fit.wide = ulam(
  wide.model,
  data = data.list,
  iter = 2000,      # 200 iterations, (1000 warmup)
  chains = 4,       # for chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan

The warning indicated computational problems. How do the traceplots look like:

traceplot_ulam(u.fit.wide, pars = c("b1","b2"), n_cols = 1)

The chains are obviously non-stationary and have not converged.

Here is the model summary:

precis(u.fit.wide) %>% round(2)
##              mean     sd     5.5%   94.5% n_eff Rhat4
## a            0.01   0.11    -0.17    0.17 31.23  1.10
## b1         317.02 614.81  -625.16 1374.16  2.46  3.06
## b2        -315.05 614.81 -1372.24  627.13  2.46  3.06
## log_sigma    0.08   0.08    -0.07    0.20 28.83  1.06

The pairs plot is useful to show correlations between parameters, which explains our problem.

pairs(u.fit.wide)

This model did clearly not work out. Lets look again at the model and think about how we could improve the model. We start with prior predictions:

prior = extract_prior_ulam(u.fit.wide)
mu = link_ulam(
  u.fit.wide,
  data = list(X1 = range(data.list$X1)),
  post = prior,
  n = 50)
par(mfrow = c(1,2), mar=c(3,3,1,.5), mgp=c(1.75,.5,0), tck=-.01)
plot(X[,1],Y, xlab = "X1")
matplot(range(data.list$X1),
        t(mu[1:50,]), 'l',
        ylab = "Y_hat",
        col = adjustcolor("blue", alpha = .25),
        lty = 1, xlab = "X1")

The left panel shows observed Y and X1 values. The right panels shows that the prior for beta allows slope that are orders of magnitude steeper than is reasonable. Lets look at the priors again:

wide.model
## [[1]]
## Y ~ normal(mu, exp(log_sigma))
## 
## [[2]]
## mu <- a + b1 * X1 + b2 * X1
## 
## [[3]]
## a ~ dnorm(0, 1)
## 
## [[4]]
## b1 ~ dnorm(0, 1000)
## 
## [[5]]
## b2 ~ dnorm(0, 1000)
## 
## [[6]]
## log_sigma ~ dnorm(0, 1)

The priors on b1 and b2 are indeed very wide. Note that a 4 unit change on the x axis corresponds to a 12 unit change on the y axis, for only one parameter b1 or b2. 12/4/2 = 1.5, so even if we set the prior sd for b1 and b2 to 15, we are stilling using very wide priors (though not ridiculously wide). Lets try this model:

less.wide.model = alist(
  Y ~ normal(mu,exp(log_sigma)),
  mu <- a + b1*X1 + b2*X1,
  a ~ dnorm(0,1),
  b1 ~ dnorm(0,15),
  b2 ~ dnorm(0,15),
  log_sigma ~ dnorm(0,1))

if (file.exists("u.fit.less.wide.Rdata")) {
  load("u.fit.less.wide.Rdata")
} else {
 u.fit.less.wide = ulam(
  less.wide.model,
  data = data.list,
  iter = 2000,      # 200 iterations, (1000 warmup)
  chains = 4,       # for chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan
  save(u.fit.less.wide,file = "u.fit.less.wide.Rdata")
}


traceplot(u.fit.less.wide, max_rows = 1, pars = c("b1","b2"), n_cols = 1)
## Waiting to draw page 2 of 2

precis(u.fit.less.wide) %>% round(2)
##           mean    sd   5.5% 94.5%   n_eff Rhat4
## a         0.02  0.11  -0.15  0.19 2236.35     1
## b1        0.91 10.49 -15.68 17.80 1547.19     1
## b2        1.06 10.49 -15.81 17.69 1547.57     1
## log_sigma 0.07  0.07  -0.04  0.18 2238.94     1

This is much better, and shows us that the initial problem was not due to problems with the sampler, but due to problems with the model definition.

Why does changing the prior allow us to estimate the model?

for the model \[ \small y \sim normal(a + \beta_1 X_1 + \beta_2 X_2, \sigma) \]

the likelihood will be the same for

Now lets look at the posterior probability under a normal prior with sd = 10:

c(dnorm(c(1,1), sd = 10, log = T) %>% sum(),
  dnorm(c(0,2), sd = 10, log = T) %>% sum())
## [1] -6.453047 -6.463047
exp(dnorm(c(1,1), sd = 10, log = T) %>% sum()-
      dnorm(c(0,2), sd = 10, log = T) %>% sum())
## [1] 1.01005

The two parameters combinations are more or less equally likely and a proposal where both betas are close to 0 does not have a much higher acceptance probability than proposal where one parameter value has a very small value and the other has a large value.

Now we try again with a narrower prior:

c(
  dnorm(c(1,1), sd = 1, log = T) %>% sum(),
  dnorm(c(0,2), sd = 1, log = T) %>% sum())
## [1] -2.837877 -3.837877
exp(dnorm(c(1,1), sd = 1, log = T) %>% sum()-
      dnorm(c(0,2), sd = 1, log = T) %>% sum())
## [1] 2.718282

Now the advantage of the parameter combination with 2 values close to 0 is clear and the posterior is easy to find. This is the intuition why Bayesian models do generally have less problems with colinear models that are not or only weakly identified in the frequentist approach.

Take home messages from exercises M1 and M2

In M1 the standard deviation can be estimated well with an exponential or uniform prior, because both allow all values above 0. The exponential distribution implements a “soft constraint” by making values over 10 very unlikely, but still possible.

par (mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)
curve(dexp(x,1),0,10, ylab = "Prior density", xlab = "sigma")
curve(dunif(x,0,10),0,10, add = TRUE, col = "red")

In M2 b can no longer be estimated well, because the exponential prior implements a hard constraint that makes values below 0 impossible.

dexp(-.001,.3)
## [1] 0
par (mar=c(3,3,0,1), mgp=c(2,.7,0), tck=-.01, new = FALSE)
curve(dnorm(x,0,1),-3,10, ylab = "Prior density", xlab = "b")
curve(dexp(x,.3),add = TRUE, col = "red", n = 10000)


  1. As with all thresholds, this should be seen in context. A Rhat values of 1.1 can be OK, a Rhat value > 1.1 is not OK.↩︎